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Abstract 

In "Li, L. and Yin, X. (2008). Sliced Inverse Regression with Regu- 
larizations. Biometrics, 64(1):124-131" a ridge SIR estimator is intro- 
duced as the solution of a minimization problem and computed thanks 
to an alternating least-squares algorithm. This methodology reveals 
good performance in practice. In this note, we focus on the theoretical 
properties of the estimator. Is it shown that the minimization problem 
is degenerated in the sense that only two situations can occur: Either 
the ridge SIR estimator does not exist or it is zero. 

Keywords: Inverse regression, regularization, sufficient dimension re- 
duction. 
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1 Introduction 



Many methods have been developed for inferring the conditional distribution of 
an univariate response Y given a predictor X in MP. When p is large, sufficient 
dimension reduction aims at replacing the predictor X by its projection onto 
a subspace of smaller dimension without loss of information on the conditional 
distribution of Y given X. In this context, the central subspace, denoted by 
Sy\x plays an important role. It is defined as the smallest subspace such that, 
conditionally on the projection of X on Sy\Xi Y and X are independent. In 
other words, the projection of X on Sy\x contains all the information on Y that 
is available in the predictor X. Introducing d = dim(<Sy|x) and A e M. pxd such 
that Sy\x = Span(v4), this property can be rewritten in terms of conditional 
distribution functions as 

F(Y\X) = F(Y\A T X). 

The estimation of A has received considerable attention, and among the pro- 
posed methods, Sliced Inverse Regression (SIR) [I] seems to be the most pop- 
ular one. Let us recall its definition from the minimum discrepancy point of 
view [HI2]- Starting from a n- sample, and denoting by X the average of X, S x 
the sample covariance matrix of X and assuming that the response variable Y 
is partitioned into h non-overlapping slices, the SIR estimator of A is obtained 
by minimizing 

h rp 

G(A, C) = J2fv ((X v -X)- t x AC y ) t- 1 ({X y - X) - t x AC y ) (1) 

y=l 

where f y = n y /n, n y is the number of observations in the yth slice, X y is the 
average of X in the yth. slice and C = (C±, . . . , Ch) G M. dxh . Defining 

h 

f = J2fy(Xy-X)(X y -X) T , 

y=l 
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an estimator of cov(K(X\Y)), the SIR estimator is obtained by computing the 
eigenvectors of associated to the d largest eigenvalues. It thus requires 

the inversion of S x which is not possible as soon as p > n or when the predictors 
are highly correlated. In order to overcome this problem, it has been proposed 
to use the ridge SIR estimator ([5], Definition 1) defined as follows. Let r > 
and 

h 

G T (A, C) = fvMXy -X)- t x AC y f + r||vec(A)|| 2 , (2) 

y=l 

where vec(.) is a matrix operator that stacks all columns of the matrix to a 
single vector. The ridge SIR estimator of the central subspace Sy\x is Span(A) 
where 

(A,6)=axgmmG T (A,C). (3) 

A,C 

From the practical point of view, an alternating least-squares algorithm is 
proposed to solve this optimization problem [5]. It revealed good performances 
on simulated and real data. Here, we focus on the theoretical aspects. To this 
end, let us highlight that definition fl3]) assumes the existence of a unique 
minimum of G T . In Section [2J we prove that this is not the case. In fact, 
either argminGV = 0, and thus the ridge SIR estimator does not exist, or 
argminGV C {0} x ~R dxh and consequently the ridge SIR estimator is zero. 
A modification of the criterion (|2J) is proposed in Section [3] leading to the 
estimator of A proposed in [7j. Proofs are postponed to the Appendix. 

2 On the existence of the ridge SIR estimator 

Before stating our main result on the existence of the ridge SIR estimator, 
remark that G T (r > 0) does not penalize the same way two proportional 
matrices A and XA, A G M \ {0}, although defining the same central subspace 
since Span(A) = Span(AA). This lack of invariance may explain why the ridge 
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SIR estimator is ill-defined as illustrated below. 

Proposition 1. Let r > 0. 7/argminG r 7^ then A defined by |3|) is the zero 
p x d matrix. Moreover, 

h 

G T (A, C) = G T (0, C) = J2 fv\\X v " ^H 2 ' ( 4 ) 

y=l 

for all C e R dxh , 

Since (jlj) does not depend on C, it follows that either argminG r = or 
argminGV C {0} x M dxft -. The following proposition permits to distinguish 
between the two cases. 

Proposition 2. Let r > and assume rank(S :[ .) > d. Then, argminG r = 
if and only if there exists y £ {1, . . . , h} such that Y* x (X y — X) ^ 0. 

To solve the optimization problem (J2J), Li and Yin [3] proposed an alternating 
least-squares algorithm. At iteration k + 1, given A^ k \ and y4 ( - fc+1 - ) are 

updated as: 

Cf +1 > = (A^ T ±lA^y 1 A^ T ± x (Z y -X),y = l,...,h, 
vec (A^) = j £ /„ (C^ T ®t x ) T (C y k ^ T ® t x ) + rl } 

U=i 

X j^fy(C^ T ®± x ) T {Xy-X). 
y=l 

The authors claimed that such an algorithm converges. As a consequence of 
Proposition [TJ it is easily seen that the limit is always degenerated. 

Corollary 1. Let r > and denote by (A*,C*) the limit of the sequence 
(A^ k \C {k) ) k . Necessarily, A* is the zero p x d matrix. 

In view of this result, the good behavior of this algorithm on simulated and 
real data reported in [5], Section 3 and Section 4 cannot be justified from a 
theoretical point of view. 

4 



3 An alternative ridge SIR estimator 



It is possible to modify the criterion G as follows 

h 

H T (A,C) = G(A,C) + rJ2 fyUCyf. (5) 

y=l 

The first advantage of H T is to be invariant with respect to bijective transfor- 
mations, i.e. 

H T (AM, M~ X C) = H T {A,C), 

for all regular d x d matrix M. This property is natural since span(MA) = 

span(A). Second, it is readily seen that the minimization of H T does not 

require the existence of E" 1 since H T can be rewritten as 

h h 
H T (A, C) - H T (0, 0) = J2 fyCyA T (± x + rI p )AC y -2^/ y (X,- X) T AC y . 

y=l y=l 

Finally, remarking that the original criterion G of SIR (0Q) can also be expanded 

as 

h h 

G(A, C) - G(0, 0) = fyC^A T t x AC y - 2 ^ f y (X y - X) T AC y , 

y=l y=l 

it appears that H T (A,C) — if T (0,0) can be deduced from G(A,C) — G(0,0) 
by substituting S x + tI p to E x . Consequently, the estimator of A obtained 
by minimizing (J5J is the Regularized SIR estimator introduced in [7] since 
its columns are the eigenvectors of (E x + ri p ) _1 r associated to the d largest 
eigenvalues. As a conclusion, the introduction of the new functional flS} pro- 
vides a theoretical framework for the Regularized SIR estimator [7]. Thus, a 
crossvalidation criterion could be derived, similarly to (8) in [3], for selecting 
the regular ization parameter r. 
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Appendix 



Proof of Proposition Q] — Let us remark that 

h 

G T (A,C) = Y,fy(\\^y-^\\ 2 - 2 ^y- Z ) T ^ AC v + C y AT ^l AC y) 

y=l 

+ r||vec(A)|| 2 . (6) 
Using the equality (see for instance [3j, Chapter 16, equation (2.13)), 

£ x AC y ={(%®± x )wec{A), (7) 
for all = 1, . . . , h and denoting a = vec(A), we thus have: 

h 

G T (A,C) = G*(a,C) = ^/ y {||X,-Xf-2(X,-X) T (Cj ®± x )a 

y=l 

+ a T (CT ® S,) T (Cj ® £ x )a} + r||5|| 2 . 
Suppose argminG> ^ and consider 

(A,C) G argminG r (A,C , ). 

A,C 

From [6], pp. 119-120, it follows that, necessarily, (A,C) is a stationary point 
of G T and thus satisfy the set of equations: 

V i G*(a,d 1 ,...,d h ) = 0, z = l,...,/i + l, (8) 

where a = Vec(A), C = (Ci, . . . , Ch) and V« denotes the gradient of G* with 
respect to its zth argument, % = 1, . . . , h + 1. Straightforward calculations lead 
to: 

ViG^aA ■■■,&) = 2j2fy{(C^t x ) T (C^t x )a 

y=l 

(CT®£ x ) T (X y -X)} + 2ra, (9) 
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and, for y — 1, . . . , h, 



V y +iG*(a, Ci, . . . , Ch) — V y+ iG T (A,Ci, . . . ,Ch) 

= 2f y {A T £ 2 x Ac y - A T ± x (x y - x)}. (10) 

Thus, multiplying of the left by (7j and using ([7]), it follows 
C^V y+1 G* T (a,C 1 ,...,C h ) = 2f y [c y r (A T ±lA)C y -C y r A T £ x (X y -X)} 



2f y [a T (C^i: x ) T (C^®J: x )a 
a T {C T y ®£ x ) T {Xy-X)}. (11) 



Hence, collecting (Q and (TTTj) . it appears that 



h 

V 1 G* T (a, C u . . . , C h ) = C^V y+1 G* T (a, C u ..., C h ) + 2r||, 

y=l 



Since the regularization parameter r is positive, condition ([S]) implies ||a|| 2 = 0, 
i.e. A is the zero p x d matrix. Replacing in fl6]), we have 

h 

G T (A, C) = G T (0, C) = fv\\X y -X\\\ 

y=l 

for all C G ~R dxh and the result is proved. ■ 



Proof of Corollary [T] — The limit of the sequence verifies the set of equations 

C* y = (A* T ±lAT 1 A* T ± x (Xy-X),y = l,...,h, 
h - - 1 



vec(A*) = <Y,fv{ C v T ^) {c; T ®%)+rI pd 

U=i 

y=l 



Thus, from (flOl) it follows that 

V y+1 G T (A\Cl...,C* h ) = 



for all y — 1, . . . , h, while, from (JH]), 

V 1 G T (A*,C* 1 ,...,C* h )=0. 

Consequently, (A*, C*) is a stationary point of G T , and, following the proof of 
Proposition QTJ necessarily A* is the zero p x d matrix. ■ 

Proof of Proposition [2] — First, let us suppose that fl x (X y — X) = for all 
y £ {1, . . . , h}. Then, 

h h 

G T (A,C) = ^/Jl^-Xf + ^/^J^S^ + rllvec^f 

y=l y=l 
h 

y=l 

= G T (0,C), 

which entails that G T (A, C) is minimum for every C if A is the zero matrix. 
As a consequence argminGV ^ 0. This concludes the first part of the proof. 
Conversely, suppose there exists yo G {1, . . . , h} such that T: x (X yo — X) ^ 0. 
Let t > and let us prove that there exist A e W xd and C e R dxh such that 
G T (A, C) < G T (0, C). To this end, let % — 1, . . . ,p be the eigenvectors of 
T> x associated to the eigenvalues Aj, i = 1, . . . ,p. Since 

v 

± x (X yo - X) = ^qJ(X yo ~X)^0, 
i=i 

there exists an eigenvector q* associated to a random value A* > such that 
q*q* T {X yo - X) ^ 0. Thus \\{X yo - X) T q*\\ ^ 0, and let e such that: 



0<e< A /^||(X W) -X)Vl|. 



The matrices A and C are defined as follows. The first column of A is the vector 
eq* and the d — 1 following columns of A are the vectors eg^, z = 1, . . . , d — 1 
where the 's are orthogonal eigenvectors (with unit norm) of S x associated 
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to positive eigenvalues Xj/s. Note that, since rank(S :r ) > d, such a matrix A 
always exists. All the columns of C are chosen to be the null vector except the 
y th one defined by: 

C yo = (A T ±lArA T ± x (X yo - X) = \ %L, . . . , ^-)\x yo - X). 

\ Jl Jd — 1 / 

Such choices entail 

G T {A,C)-G T {0,C) 

h 

= E fv {<%(A T ±lA)C y - 2 ( x v - X) T t x AC y ) + r||vec(A)|| 2 

y=l 

= f yo {cl(A T ± 2 x A)C yo - 2(X yo - X) T t x AC m ] + r||vec(A)|| 2 
= -fyo(X yo - Z) T i±A(A T i%A)- 1 A T ± x {X V0 -X) + r||vec(A)|| 2 

d-l 

= -fy \\(X yo - X)Vf - f yo £ IK** - X) T q k f + rde 2 

i=l 

< -f yo \\(X ya -X) T q*\\ 2 + Tde 2 

< 0, 

from (]12p . Thus, (0, C) ^ argminG r and taking account of Proposition [T] 
yields argminGV = 0. ■ 
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